ML Model Deployment

ML in production vs. ML in research

These are important for ML in production than in research:

stakeholder involvement
computational priority
the properties of data used
the gravity of fairness issues
the requirements for interpretability

Model Serving

key components
- a model
- an interpreter
- input data
important metrics:
- latency = the time it takes from receiving a quest to returning the result
- throughput/successful request rate = how many queries are processed within a specific period of time
- cost: for compute resources (CPU, etc.)
- minimizing latency and maximizing throughput will increase cost, which can be balanced by
  - reducing costs by GPU sharing
  - multi-model serving
  - optimizing models used for inference

Deployment procedures

ML model deploy in production
Monitor & maintain system (ML System Monitoring and Continual learning)

Deployment approaches

Cloud-based deployment
- deploy a model on a cloud service provider like AWS or GCP. You can use services like AWS Elastic Beanstalk, Google Kubernetes Engine, or Azure Machine Learning to deploy and manage your model.
Containerization
- containerize a model using tools like Docker and deploy it on a container orchestration platform like Kubernetes.
Server-based deployment
- deploy a model on a dedicated server or a cluster of servers, such as using TensorFlow Serving. The servers run an application that exposes an API endpoint (REST or gRPC), which can be accessed by client applications to make predictions on the model.
Serverless deployment
- deploy a model as a serverless function using platforms like AWS Lambda or Google Cloud Functions.
Mobile deployment
- deploy a model on the mobile device itself, using tools like Core ML for iOS or TensorFlow Lite for Android.

Batch Prediction Versus Online Prediction

What are they

batch prediction = when predictions are generated periodically or whenever triggered. The predictions are stored somewhere, such as in SQL tables or an in-memory database, and retrieved as needed.
online/real-time prediction = when predictions are generated and returned as soon as requests for these predictions arrive

	Batch prediction (asynchronous)	Online prediction (synchronous)
Frequency	Periodical, such as every four hours	As soon as requests come
Useful for	Processing accumulated data when you don’t need immediate results (such as recommender systems)	When predictions are needed as soon as a data sample is generated (such as fraud detection)
Optimized for	High throughput	Low latency
Cost	Pay for resources while endpoint is running	Pay for resources for the duration of the batch job

Unifying Batch Pipeline and Streaming Pipeline

Nowadays, the two pipelines can be unified by having two teams: the ML team maintains the batch pipeline for training while the deployment team maintains the stream pipeline for inference

How to accelerate ML model inference

There are three main approaches to reduce its inference latency:

make it do inference faster: model optimization
- locally
  - Vectorization
  - Parallelization: Given an input array (or n-dimensional array), divide it into different, independent work chunks, and do the operation on each chunk individually
  - Loop tiling: Change the data accessing order in a loop to leverage hardware’s memory layout and cache
  - Operator fusion: Fuse multiple operators into one to avoid redundant memory access
- globally
make the model smaller: model compression
- low-rank factorization
  - replace high-dimensional tensors with lower-dimensional tensors
- knowledge distillation
  - a method in which a small model (student) is trained to mimic a larger model or ensemble of models (teacher)
- pruning
  - originally for decision trees (Decision Tree & Random Forest#^4c01d0)
  - now also mean reducing the extra parameters in neural networks
- quantization
  - reduces a model’s size by using fewer bits to represent its parameters. e.g representing a float number using 32 bits instead of 16 bits
  - most general and commonly used model compression method
make the hardware it’s deployed on run faster: ML on the Cloud and on the Edge
- On the cloud = a large chunk of computation is done on the cloud, either public clouds or private clouds
- On the edge = a large chunk of computation is done on consumer devices
- online and on-the-edge ML models need powerful hardware

Example Workflow for Deploying a Production Machine Learning Model

Set Up a Secure AWS Environment
- Tools: AWS Identity and Access Management (IAM), Amazon SageMaker
- Define and manage security using IAM policies.
Data Analysis and Preparation
Model Training and Hyperparameter Tuning
- Tools: Amazon SageMaker, GPU instances
- Set up hyperparameter tuning and perform multi-GPU instance training.
Model Evaluation
Deploy Model to AWS
- Tools: AWS API Gateway, AWS Lambda (AWS for Data Science#^94aba2)
- Deploy the trained model using AWS API Gateway and Lambda functions, ensuring access via REST API.
Test the API
- Tools: Postman
- Verify API functionality by sending test requests.
Secure and Optimize Deployment
- Tools: AWS API Gateway, Lambda, AutoScaling
- Secure API endpoints (IP whitelisting) and set up auto-scaling to handle traffic efficiently.
Build and Deploy Web Application
- Tools: React.js, Node.js, Express.js, MongoDB
- Develop a MERN (MongoDB, Express, React, Node.js) web app that interacts with the AWS API.
Host the Application
- Tools: DigitalOcean
- Deploy the web app to a cloud service like DigitalOcean.
Monitor and Maintain
- Tools: Amazon CloudWatch
- Monitor performance with CloudWatch logs and manage Lambda concurrency for optimization.